In [1]:
import pip
try:
    from sklearn.datasets import fetch_california_housing
    from sklearn.datasets import load_boston
except:
    pip.main(['install','--user','sklearn'])

The belowing code in azure can't work, because 403 error

boston = load_boston()
california = fetch_california_housing()

In local machine 另起炉炤,下载boston,california存成csv,然后传入azure dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target dataset.to_csv('boston.csv')


In [2]:
from azureml import Workspace
ws = Workspace(
    workspace_id='3c64d445b4c840dca9683dd47522eba3',
    authorization_token='JaC5E2q5FouX14JhvCmcvmzagqV63q0oVIbu2jblLBdQ5e5wf/Y24Ed6uXLvbSUgbiao5iF85C3uufYKQgXoNw==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['boston.csv']
df = ds.to_dataframe()

In [3]:
df.head()


Out[3]:
Unnamed: 0 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

In [4]:
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import matplotlib as mpl

In [5]:
%matplotlib inline  
# If you are using IPython, this will make the images available in the Notebook

以下代码是实现正态分布


In [6]:
import matplotlib.mlab as mlab
x=np.linspace(-4,4,100)
for mean,variance in [(0,0.7),(1,1.5),(-2,0.5)]:
    plt.plot(x,mlab.normpdf(x,mean,variance))
plt.show()



In [7]:
y=mlab.normpdf(x,0,1)

In [8]:
type(y)


Out[8]:
numpy.ndarray

均值的两种计算方法 SSE ----误差平方和/和方差 直方图的写法,第一栏表示和方差在0-100以内的有大约350组


In [9]:
print(df['target'].mean())
print(np.mean(df['target']))
mean_expected_value=np.mean(df['target'])


22.532806324110698
22.532806324110698

In [10]:
df.ix[:,'target'].mean()


Out[10]:
22.532806324110698

In [11]:
Square_errors=pd.Series(mean_expected_value-df['target'])**2
SSE=np.sum(Square_errors)
print('Sum of Squared Errors (SSE): %f'%SSE)


Sum of Squared Errors (SSE): 42716.295415

In [12]:
density_plot=Square_errors.plot('hist')


标准化 标准化过后,均值为0,方差为1


In [13]:
def standardize(x):
    return (x-np.mean(x))/np.std(x)

In [15]:
standardize_target=standardize(df['target'])

In [16]:
standardize_target.std()


Out[16]:
1.0009896093465709

In [17]:
standardize_target.mean()


Out[17]:
-3.020859802972165e-15

这个函数计算共变性


In [21]:
def covariance(variable_1, variable_2, bias=0):       
    observations = float(len(variable_1))      
    return np.sum((variable_1 - np.mean(variable_1)) * (variable_2 - np.mean(variable_2)))/(observations-min(bias,1))

这个函数计算相关性,区别就是输入经过标准化


In [22]:
def correlation(var1,var2,bias=0):      
    return covariance(standardize(var1), standardize(var2),bias)

In [20]:
from scipy.stats.stats import pearsonr  
print ('Our correlation estimation: %0.5f' % (correlation(df['RM'], df['target'])))  
print ('Correlation from Scipy pearsonr estimation: %0.5f' % pearsonr(df['RM'], df['target'])[0])


Our correlation estimation: 0.69536
Correlation from Scipy pearsonr estimation: 0.69536

In [23]:
print(pearsonr(df['RM'],df['target']))


(0.69535994707153925, 2.4872288710082951e-74)

Let's graph what happens when we correlate two variables. Using a scatterplot, we can easily visualize the two involved variables. A scatterplot is a graph where the values of two variables are treated as Cartesian coordinates; thus, for every (x, y) value a point is represented in the graph


In [28]:
x_range = [df['RM'].min(),df['RM'].max()]  
y_range = [df['target'].min(),df['target'].max()]  
scatter_plot = df.plot(kind='scatter', x='RM', y='target',xlim=x_range, ylim=y_range)
meanY = scatter_plot.plot(x_range, [df['target'].mean(),df['target'].mean()], '--' , color='red', linewidth=1) 
meanX = scatter_plot.plot([df['RM'].mean(),df['RM'].mean()], y_range, '--', color='red', linewidth=1)


The scatterplot also plots the average value for both the target and the predictor variables as dashed lines. This divides the plot into four quadrants. If we compare it with the previous covariance and correlation formulas, we can understand why the correlation value was close to 1: in the bottom-right and in top-left quadrants, there are just a few mismatching points where one of variables is above its average and the other is below its own.A perfect match (correlation values of 1 or -1) is possible only when the points are in a straight line (and all points are therefore concentrated in the right-uppermost and left-lowermost quadrants). Thus, correlation is a measure of linear association, of how close to a straight line your points are. Ideally, having all your points on a single line favors a perfect mapping of your predictor variable to your target.